-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add allow_nan_equality option to assert_approx_df_equality #29
base: main
Are you sure you want to change the base?
Add allow_nan_equality option to assert_approx_df_equality #29
Conversation
The objective of the issue was to make allow_nan_equality available to assert_approx_df_equality, but on reviewing the code it seemed like a more efficient plan to include the precision option in assert_df_equality and make it optional. This results in only one set of functions needed to be maintained. I've moved the single assert_rows_equality function into row_comparer along with the custom exception which I've renamed. I've made one generic equality function which takes options. All the tests have been updated although there could be scope for a few more new tests - I wanted to consult on that first. This is a fairly large change to the API for which a major version update would be recommended.
Changed the section on assert approx df equality.
I've removed assert_approx_column_equality from the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some part where you call dtypes or schema to figure out which columns are floating / decimal point (and need approximate comparison) and which columns need exact equality comparison?
This refactoring looks promising. I need to understand more about if it causes any Python version limitations, backwards incompatibility issues, or performance drains compared to what we currently have. It's certainly more elegant, but this lib is already in production for a variety of users and want to make sure it's not going to be a breaking change for anyone. Thanks!
|
||
""" | ||
both_floats = (isinstance(x, float) & isinstance(y, float)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do this cause we know the types from the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @MrPowers , can you have a look at my draft pull request here to see if I'm going along the right lines with your 3rd objective?
To summarise, rather than the following line in check_equal
:
both_floats = (isinstance(x, float) & isinstance(y, float))
I'm doing this:
is_float_type = (dtype_name in ['float', 'double', 'decimal'])
Where dtype gets passed into the function, and is created with the following line:
dtypes = [field.dataType.typeName() for field in df1.schema]
Because we've already compared the schemas, we know that they're equal so we just use df1.schema
.
Will using isinstance
really have a noticeable effect on the speed of the tests? I feel like it might be the safer option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did some further testing too, and my current solution would not work for pyspark columns with DecimalType
, as the python type is decimal.Decimal
not float
. I'd need to change that if the original implementation is selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any time to give me a steer this week @MrPowers ? As in point me in the right direction for the solution (sorry realised giving a steer might be a localised phrase) - where are you from anyway?
@mitches-got-glitches - thanks for following up and sorry for the delay. Here are my main objectives with this PR:
Looking forward to driving this towards completion & appreciate your persistence! BTW - I'm a Python noob, so you're teaching me a lot! |
Python 3.6+ is what's currently supported. Spark 3.1 dropped support for Python 2.7, 3.4, and 3.5. |
Since the Python |
Looks like Spark 2.4.5 technically supports Python 3.4+ and I'd still like to support the Spark 2 users for a bit longer. That said, seems reasonable to set the Python version as >=3.5 for this project. Python 3.5 was released in September 2015 and think most users will at least be using Python 3.5, even the ones that are still using Spark 2. Sidenote, when chispa eventually drops support for PySpark 2, we should do a major release (i.e. bump chispa to 1.x). I will bump the Python version in the project now and add a section to explain this to users. |
To add another data point, my team is currently using Python 3.6.8 and Spark 2.4. |
@mitches-got-glitches - great, glad we're still supporting PySpark 2.4 😉 Updated the required Python version. So yes, the |
@mitches-got-glitches - BTW, see here for the chispa download statistics. If this lib didn't have any users yet, then I wouldn't really care about breaking API changes. Since there are a lot of users, that's why I'm also focusing on backwards compatibility. A blessing and a curse I suppose 😉 |
Yeah, those are some good numbers! I think what you've set out to do is exactly right with the change to the readme and intention to launch 1.x. Users can always open a ticket for backwards compatibility if it's really necessary, or even create their own fork. But I agree, try not to get on their bad side ha. So hopefully, once the thread above is resolved, then I think all that will be required is an update to the screenshots in the readme if you're happy with everything else @MrPowers ? |
@mitches-got-glitches - Thanks for following up and sorry I've been delayed responding. I've been working on my first Dask PR and it's been hard for me to get up-and-running on a completely new framework. I'm still a Python n00b. My initial hesitation for merging this PR was that it wasn't 100% backwards compatible. Is that still the case? Or does the current PR introduce some backwards incompatible changes? |
Hello, sorry for the slow reply. I think in terms of backwards compatibility there were two things:
I believe this can be done by doing something like the following with a deprecation warning so users know to switch their code base over to def assert_approx_df_equality(*args, **kwargs):
assert_df_equality(*args, **kwargs) Does that look like a good idea? There are two more outstanding things I need your help on before this can come out of draft status:
Are you able to support me with those @MrPowers ? |
This pull request solves the issue by making a fairly big change to the API. Now, rather than having two assertion functions for both DataFrame and column comparison, there is just one for each. Both these assertion functions have an optional parameter for
precision
, meaning theassert_approx_df_equality
andassert_approx_column_equality
could be removed.In addition to solving the ticket, I feel like I made a number of improvements to the code base:
numpy
docstring conventionActions for @MrPowers :
RowsNotEqualError
and consider any implicationsWhen making a new release:
Let me know what you think!
Closes #28